Diverse Topic Phrase Extraction from Text Collection

نویسندگان

  • Jilin Chen
  • Benyu Zhang
  • Dou Shen
  • Qiang Yang
  • Zheng Chen
  • Qiansheng Cheng
چکیده

Keyword extraction is an efficient approach to managing an explosion of online text on the Web. Traditionally, an abstraction of the online text is constructed though keywords, which are extracted according to a certain importance measure. One such measure is their occurrence frequency. However, previous work has not considered another important factor: the diversity of the keywords. Therefore, the extracted keywords tend to crowd on one hot topic in the corpora while failing to cover other important topics. In this paper, we propose new algorithms to alleviate the disadvantages of these traditional methods for keyword extraction. Firstly, we propose to extract key phrases instead of keywords because phrases can effectively reduce the ambiguity of single words. Secondly, by leveraging latent semantic analysis, we can learn the related topics for each phrase as well as the distance among the phrases, so that the extracted phrases are able to cover more topics. To demonstrate the performance of our method, we conducted experiments on two open datasets: 20 Newsgroup and Reuters-21578.We design three novel evaluation metrics, based on which both qualitative and quantitative analyses shows that our proposed algorithm can be used to improve the key phrase extraction performance significantly.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Acquiring Topic Features to improve Event Extraction: in Pre-selected and Balanced Collections

Event extraction is a particularly challenging type of information extraction (IE) that may require inferences from the whole article. However, most current event extraction systems rely on local information at the phrase or sentence level, and do not consider the article as a whole, thus limiting extraction performance. Moreover, most annotated corpora are artificially enriched to include enou...

متن کامل

Accurate Keyphrase Extraction from Scientific Papers by Mining Linguistic Information

In this paper we investigate the impact of candidate terms filtering using linguistic information on the accuracy of automatic keyphrase extraction from scientific papers. According to linguistic knowledge, the noun phrases are most likely to be keyphrases. However the definition of a noun phrase can vary from a system to another. We have identified five POS tag sequence definitions of a noun p...

متن کامل

A survey on phrase structure learning methods for text classification

Text classification is a task of automatic classification of text into one of the predefined categories. The problem of text classification has been widely studied in different communities like natural language processing, data mining and information retrieval. Text classification is an important constituent in many information management tasks like topic identification, spam filtering, email r...

متن کامل

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

Constructing Knowledge Maps of A Manager's Managerial Logic by A Text Mining Approach

The objective of this research is to represent the managerial logic of Mr. Yung-Ching Wang, the Chairman of Formosa Plastics Group (also known as the “God of Business” in Taiwan) through the construction of knowledge maps using a text-mining approach, including automatic key phrase extraction, term identification, document vector modeling, and a clustering method named growing hierarchical self...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005